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Abstract — Effective reconstitution approaches for cyber systems 
are needed to keep critical infrastructure operational in the face 
of an intelligent adversary. The reconstitution response, including 
recovery and evolution, may require significant reconfiguration 
of the system at all levels to render the cyber-system resilient to 
ongoing and future attacks or faults while maintaining continuity 
of operations. A theoretical basis for optimal dynamic 
reconstitution is needed to address the challenge of ensuring that 
dynamic reconstitution is optimal with respect to resilience 
metrics, and is being developed and evaluated in this project. 
Such a framework provides the technical basis for evaluating 
cyber-defense and reconstitution approaches. This paper 
describes a preliminary framework that may be used to develop 
and evaluate concepts for effective autonomous reconstitution of 
compromised cyber systems. 
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I. Introduction 

The ability to maintain mission-critical operations in cyber- 
systems in the face of disruptions is critical. Faults in cyber 
systems can come from accidental sources (e.g., natural failure 
of a component) or deliberate sources (e.g., an intelligent 
adversary). Natural and intentional manipulation of data, 
computing, or coordination are the most impactful ways that an 
attacker can prevent an infrastructure from realizing its mission 
goals. Under these conditions, the ability to reconstitute critical 
infrastructure becomes important. Specifically, the question is: 
Given an intelligent adversary, how can cyber systems respond 
to keep critical infrastructure operational? 

In cyber systems, the distributed nature of the system poses 
serious difficulties in maintaining operations, in part because a 
centralized command and control apparatus is unlikely to 
provide a robust framework for resilience. Resilience in cyber- 
systems, in general, has several components, and requires the 
ability to anticipate and withstand attacks or faults, as well as 
recover from faults and evolve the system to improve future 
resilience. The recovery effort and any subsequent evolution 
may require significant reconfiguration of the system at all 
levels — hardware, software, services, permissions, etc. — if the 
system is to be made resilient to further attack or faults. This is 



especially important in the case of ongoing attacks, where 
reconfiguration decisions must be taken with care to avoid 
further compromising the system while maintaining continuity 
of operations. Collectively, we will label this recovery and 
evolution process as "reconstitution." Currently, reconstitution 
is performed manually, generally after-the-fact, and usually 
consists of either standing up redundant systems, check-points 
(rolling back the configuration to a "clean" state), or re-creating 
the system using "gold-standard" copies. For enterprise 
systems, such reconstitution may be performed either directly 
on hardware, or using virtual machines. 

A significant challenge within this context is the ability to 
verify that the reconstitution is performed in a manner that 
renders the cyber-system resilient to ongoing and future attacks 
or faults. Fundamentally, the need is to determine optimal 
states of the cyber system when a fault is determined to be 
present. 

Contributions: This paper presents preliminary research 
towards concepts for effective autonomous reconstitution of 
compromised cyber systems. We describe a mathematical 
formulation as a first step towards a theoretical basis for 
autonomous reconstitution in dynamic cyber-system 
environments. We then propose formulating autonomous 
reconstitution as an optimization problem and describe some of 
the challenges associated with this formulation. This is 
followed by a brief discussion on potential solutions to these 
challenges. 

n. Background 

The quest for resilient cybersystems has seen a number of 
potential approaches, each of which attempt to add specific 
properties to the system that make it resilient to one or more 
attack vectors. Examples of these properties include diversity, 
redundancy, deception, segmentation, and unpredictability. 
One approach that incorporates some of these elements is 
moving target defense [1]. These types of methods are usually 
heuristic and can be shown using empirical approaches to 
provide some level of resilience. However, these approaches 
cannot be readily applied after a system has been 
compromised. It is also not clear whether such methods are 
applicable under any attack scenario, or if there are specific 
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limitations. Fundamental to understanding the applicability of 
many of these approaches as attacks evolve is the ability to 
mathematically define cyber-systems in some manner. 
Specifically, there is a need to determine the key properties of a 
cyber-system, determine nominally safe configurations of the 
system in terms of one or more metrics, and define the 
mathematical framework that uses this information. 

As might be expected, the problem of cyber-system 
resilience has seen significant research over the past few years. 
A number of groups (MITRE, Raytheon) have proposed 
frameworks for resilience that encompass the basic ideas 
(anticipate, withstand, recover, evolve) (Fig. 1). The right side 
of the figure (anticipate, withstand) might be generally thought 
of as enabling robust design of the cyber-system, while left-half 
corresponds to reconstitution in the event of compromise. 
However, these two elements (robustness and reconstitution) 
are not independent, and leverage information from each other 
(for instance, information available during reconstitution may 
inform robust design, and reconstitution approaches may be 
constrained by robust design concepts). 

Frameworks for resilience that incorporate operational 
aspects have also been proposed. Reference [2] also propose an 
operational framework for resilience but acknowledge the 
difficulties in a practical implementation. Reference [3] discuss 
the use of probabilistic risk assessment as a tool for 
understanding and improving resilience. 

III. Related Work 

Prior work on resilience (and recovery from attacks) can 
largely be categorized into approaches based on fault-tolerance 
algorithms in a Byzantine fault environment [4], and 
approaches based on moving target defense [1, 5], While 
existing theories for fault tolerance (e.g., Byzantine fault 
tolerance) can guarantee resilience under certain conditions [6, 
7], in practice, these theories can break down in the face of an 
intelligent adversary. Further, it is difficult, in a dynamically 
evolving environment, to determine whether the necessary 
conditions for resilience have been met, resulting in difficulties 
in achieving provably resilient operation. In addition, existing 
theories often do not sufficiently take into account 
computational cost [8, 9] (adversary is assumed to have infinite 
resources and time), hierarchy of importance (all network 
resources are assumed to be equally important), and the 
dynamic nature of some attacks (i.e., as the attack evolves, can 
fault tolerance be maintained?). 

A number of other research developments may be of 
relevance. These include self-stabilizing systems [10, 11], 
distributed algorithms in systems with sectional faults [12], and 
self -organizing systems [13]. Each of these approaches has the 
potential to improve cyber-resilience. However, these theories 
will need to be augmented to account for an intelligent 
adversary. Conversely, game theory and other conflict models 
[14] bear on intelligent adversaries, but may not always 
account for faults. 
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Figure 1 . Elements necessary for resilience in cyber-systems 

Recent publications (such as [15]) indicate that a dynamic 
reconstitution and reconfiguration effort may be capable of 
addressing certain classes of attacks. However, the published 
information indicates a manual, "operator-in-the-loop" 
approach, where, once indicators of compromise are identified, 
operators at a Resilience Operations Control System and the 
Cyber Operations Center decide on an appropriate course of 
action and implement it. However, the test implementation of 
their framework for resiliency appears to have depended on 
commercial resiliency management systems, the market for 
which was fairly immature as of the writing of that article. 
Products in R&D phase were stand-alone and were not easily 
integrable with an enterprise-wide resiliency management 
systems, and commercially available systems were generally 
based on minor modifications to existing security products that 
did not meet the needs for resiliency. Further, the work does 
not appear to focus on asymmetry. Some prototypical products 
that have been discussed in the literature include the Network 
Maneuver [16]. 

This paper describes a state-space based formulation for 
use in reconstitution of cyber-systems. Specifically, the state C, 
of a cyber-system at time t is a representation of its key 
properties. When a system is compromised, it moves from a 
fully operational state to one of several compromised states 
(Fig. 2) where it loses the capacity to maintain continuity of 
operations. The reconstitution effort is then one of moving the 
system back to one of several fully operational states possibly 
through several intermediate states. This process may be 
defined as one of optimization, with metrics for resilience and 
continuity of operations used to determine, at each time step 
subsequent to the compromise, the optimal state (such as 
network connectivity, services, and hosts) to ensure continuity 
of operations while improving resilience. 
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Figure 2. Conceptual representation of reconstitution in cyber-systems as 
moving the system from one of several compromised states to a fully 
operational state 

IV. Mathematical Preliminaries 

We use C, to represent the state of a cyber-system at time t. 
In this paper, we assume that a graph G, may be used to 
represent the network connectivity information within the 
system at time t. The connectivity information may change 
over time as the cyber-system goes through the reconstitution 
process, and this dependency is captured explicitly. Further, we 
assume that a tensor (j), can be used to represent the 
configuration at time t of the different elements in the cyber- 
system. We define 



(1) 



Note that, in general, a cyber-system is defined in 
continuous time; that is, connectivity and configuration 
information is present at all times. However, the state of the 
cyber-system will be assumed to be discrete; that is, it can take 
on only one of a finite number of values. With this constraint, 
the system C, is an example of a continuous time, discrete- 
valued system [17]. 

Normal or abnormal traffic within the cyber-system is 
assumed to generate data D, that is a function of C„ and some 
parameters 0: 



D t =f(C„®) 



(2) 



The parameters 0 might represent, for instance, the 
portions of the cyber-system where sensors are deployed. 
Alternatively, 0 might be used to represent a set of 
transformation parameters employed to extract relevant data 
from the cyber-system. In any case, (2) is generic enough to 
cover many possibilities. For the moment, we place no 
restrictions on whether this data is available for access from 
outside the system, or only within the system. We however 
assume that D, is a random process defined by a probability 
density function on 0: 



0~P(9) 



(3) 



where 9 represents the parameters of the density function. This 
assumption is simply a reflection of the fact that, in general, the 



only kinds of system-wide information that may be obtainable 
are statistical quantities such as, for instance, the mean and 
variance of network traffic flow patterns, and that minute-to- 
minute specification of data within a cyber-system is, in 
general, difficult to obtain. Note that this assumption is not 
restrictive — in cases where a deterministic description may be 
available, it may be accounted for by setting the corresponding 
probability to 1 (interpreted as complete knowledge). We may 
further assume that D, is ergodic in the mean (i.e., time 
averages may be used to approximate process means). This is a 
simplifying assumption, and will need to be validated using 
data from realistic cyber-systems. 

Faults within the cyber-system are described within this 
formulation through a mathematical description of their effects. 
This allows for a framework that can capture both natural and 
adversarial faults, and enables the resulting reconstitution 
approach to be agnostic to specific attack vectors. We denote 
the fault sequence by the vector 



F =[F V F 2 ,..., F„...] 



(4) 



where F, denotes the fault at time f, and F h 1 < k < t-1 denote 
the sequence of faults before time f; note that future faults (i.e., 
for times greater than t) may be incorporated into this 
framework. As noted earlier, these faults may be either natural 
or due to an intelligent adversary. 

Faults at any time instant may result in hardware level 
effects (such as loss of a server) and result in a change in the 
connectivity, and consequently, in changes in the configuration 
and data. Alternatively, faults may manifest themselves at only 
the configuration level (for instance, a change in the firewall 
settings on a specific server), or within the data available in the 
cyber-system (for instance, higher than normal numbers of 
open ports, or increased network traffic). Thus, we may assume 
that, at time t, fault F, defines a mapping of the form: 



F ; : C, -> C, 



(5) 



The resulting state is generally assumed to be a less- 
desirable state (from the perspective of maintaining mission- 
critical operations, and as defined by some metrics — see 
below). For simplicity, if no fault (natural or otherwise) exists 
in the system at a given time (say t = k), we denote F i = 0. 

We use the representation above to define the needs for 
reconstitution. We do not claim that this representation is 
unique, and as other representations of cyber-systems, data, and 
fault sequences become available, may be easily incorporated 
in this work. 

V . Metrics for Continuity of Operations 

Critical to reconstitution is the ability to define continuity 
of operation. In this initial formulation, we assume that a metric 
exists that can be used for this purpose. Such a metric M, would 
need to be a function of the system state: 



M, 



s( c t) 



(6) 



and must be computable from knowledge of the system 
network topology and configuration information. A number of 
resilience metrics related to continuity of operations are defined 
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in the literature [18]. However, the bulk of these metrics are 
focused on system-level quantities (such as time to recover 
from an attack, percentage of available services, etc.). While 
these are important and help characterize the system 
performance, these are difficult to use for dynamic 
reconstitution, as computing such metrics in real-time (as the 
system is being reconstituted) from knowledge of only the 
configuration and/or connectivity is difficult. Instead, indirect 
metrics are necessary, and include graph metrics such as 
diameter, algebraic connectivity, average path length, 
clustering coefficient, although other graph statistics may be 
relevant and computable in real-time. In addition to graph 
metrics, metrics such as vulnerability scores [19] that may be 
computed from configuration information as well as a priori 
knowledge about attack vulnerabilities may also be applicable. 

An important element that needs to be captured in the 
metric is the potential dependencies between critical services, 
and between the deployment of services and the underlying 
network topology. This results from the need to ensure that the 
impact of any fault or attack and the subsequent reconstitution 
is represented correctly. For instance, service A and B may 
depend on a higher-level service C (for example, an 
authentication service). The impact of a fault that results in C 
being unavailable will impact the availability of A and B, and 
therefore the ability to meet mission needs. 

VI. A Proposed Framework for Reconstitution 

We define reconstitution as determining the state C t , t > T 0 
that ensures continuity of operation. Here, T 0 is assumed to be 
the time at which the reconstitution effort is initiated. In 
general, the problem of determining secure configurations is 
NP-complete [19]. However, solutions that are "good-enough" 
may still be possible in a reasonable time-frame. 

One approach to formulating the concept of asymmetric 
resilience is by accounting for the cost to the defender [20]. 
Assume that the cost to the attacker can be represented by R a 
while the cost to the defender is represented by R d . The cost R d 
is a dimensionless number that accounts for the infrastructure 
cost R di during the reconstitution effort (normalized to the 
initial cost incurred during the original system setup) as well as 
the risk (of continued attack or faults) associated with the 
configuration. Note that R di may be estimated from the 
connectivity graph as well as any redundancies required for 
system robustness. R a may consist of similar quantities; 
however, measuring R a is a difficult proposition. It is, however, 
reasonable to assume that the faster the reconstitution effort, 
the greater the cost to the attacker (both in terms of 
infrastructure cost to maintain an attack as well in terms of risk 
of attribution). For this initial framework, we will assume that 
R a is inversely proportional to time spent in the reconstitution 
effort: 

K,^y(t-T 0 ), t>T 0 . (7) 

A. Reconstitution as an Optimization Problem 

Given this information, one possible approach to 
reconstitution is to frame the problem as a multi-objective 
optimization problem, where asymmetric advantage may be 
introduced by, for instance, requiring that the solution 



minimizes the cost to the defender while maximizing the 
estimated cost to the attacker. Specifically, we want to 
maximize M, and R a while minimizing R d by adjusting the 
network connectivity as well as the configuration information. 
This may be represented as 

maxM,,— (8) 

subject to constraints on graph connectivity, allowed 
configurations, and available resources. This framework allows 
for the incorporation of costs into the reconstitution effort, as 
well as attacks or faults during the reconstitution effort. 

Potential approaches to optimization that are applicable 
within this framework include genetic algorithms [19], linear 
programming, and dynamic programming [21]. The framework 
also encompasses both natural and adversarial faults. 

Challenges in this context include optimization and control 
of large-scale systems, and in the face of continuing attacks. A 
further challenge is presented by the distributed nature of 
cyber-systems, in part because a centralized command and 
control apparatus is unlikely to provide a robust framework for 
resilience, and the reconstitution process will need to be 
managed in a distributed manner. 

A further challenge is the ability to explicitly reduce cost to 
the defense. The cost to the defender, while notionally simple 
to calculate, is difficult to quantify during the course of 
optimization. This is because any change in the system state 
results in incurred costs to the defense; thus, the overall cost 
over the entire reconstitution process will be the sum of the 
individual costs at each step in the process. However, 
conventional optimization approaches focus on minimizing 
only the incremental cost over the next iteration in the 
optimization. Thus, tools for optimization that find the overall 
least-cost trajectory in state space are needed. Moving to a 
distributed approach to reconstitution adds further challenges in 
this regard. 

In general, the assumption in this formulation is that the 
cost functions and constraints define a non-convex problem, 
and the resulting solution will only be locally optimal (and 
dependent on the starting state). This is because, in the context 
of the problem being studied here, a globally optimal solution 
refers to a resilient state that is capable of providing continuity 
of operations regardless of the fault or attack vector. While this 
may be theoretically possible, practical bounds on resources 
and time to recovery will necessarily restrict the space of 
possible solutions that can be explored, and result in tradeoffs. 

A specific example of the formulation is for optimal 
allocation of services assuming that the underlying topology is 
constant. In this context, we have a set of n critical services 
Si,Si,...S B and assume that a reconstitution effort is complete 
only when all the n services become operational. The resulting 
configuration may be represented by the optimal ^ . A number 
of service providers (e.g., computer servers) are assumed 
available, and connected in some topology. Resource 
constraints are assumed at each service provider, and restrict 
the number of possible services (defined by the subset 
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S ={s,,S ! „..S } ) that can be hosted on server p. We are also 

given a cost function R dp to make a service provider p 
operational: R dp : S p — > R + . We define 

«,=5X- (9) 



Finally, dependencies between services are incorporated 
through constraints on the network topology (to minimize 
delays) and the configuration (p. 

Given this setup, the goal of reconstitution is then to 
determine a minimum cost set of providers (each running a 
subset of services) such that all critical services are operational. 
This is a re-statement of the set-cover problem [22]: Given a 
universe U of n elements Si,S 2 ,--.S n , a collection of subsets of 
U, S p = {s pl ,S p ,,...S rl } , p = 1,2,. ,.P and a cost function c: 

£->R + , find a minimum cost subcollection of S that covers all 
elements in U. This particular instantiation of the reconstitution 
problem may be addressed through applicable combinatorial 
optimization approaches, though the same challenges that 
apply to all optimization-based approaches to reconstitution 
apply here. 

VII. Results 

The proposed optimization-based approach to reconstitution 
was tested using a simple network representation and a genetic 
algorithm optimization. An n-node network with m-services is 
described by two matrices. An adjacency matrix, A, defines the 
connectivity between nodes in the network. Here, a node is 
some computational resource, such as a computer or server that 
can run any of the m services. A configuration matrix, <j>, 
describes the configuration of each node with respect to the 
critical services. The configuration of node i with respect to 
service j is comprised of a triplet of binary values: is service j 
loaded on node i, does the configuration of node i support 
service j, and is service j currently running on node P. 

To test the reconstitution, a random network is simulated 
(with random adjacency and configuration matrices). Several 
constraints are imposed on the network: 

• Each node can be connected to at most 10% of the total 
number of nodes in the network; 

• Each node can only run services that are loaded on the 
node and supported by the node configuration; 

• Each node can run at most three services; and 

• Each service can be run on at most two nodes. 

Hardware faults can occur, wherein a node is permanently 
removed from the network and its services are lost. The genetic 
algorithm attempts to reconstitute the network to restore critical 
services and improve network resilience, subject to the same 
constraints above. The genetic algorithm fitness function 
attempts to maximize network robustness, characterized by the 
algebraic connectivity [23], and the availability of critical 
services, while minimizing the cost to update the network. 
Costs are incurred by adding or removing network connections, 
loading services on nodes, and changing node configurations to 
support new services. 



In this initial implementation, the following assumptions 
are made: 

• Dependencies between services, or services and the 
underlying topology, are ignored. 

• Incrementally minimizing costs at each iteration is 
assumed to minimize overall cost. 

• All network nodes are assumed to be functionally 
equivalent. 

• Any node that fails is assumed to be permanently 
removed from the network. 

These assumptions will be relaxed as this formulation is 
developed further. 

The efficacy of this approach is demonstrated on a 30-node 
network with 50 critical services. Two cases are considered: 
first, a single fault occurs at time zero and the network is 
reconstituted; second, a series of faults occur, one at time zero 
and another as the network reconstitution is underway. Fig. 3 
shows an example of the reconstitution performance. The 
figure shows the average number of services restored, and the 
robustness of the network after the fault (as measured by the 
algebraic connectivity), over time (arbitrary units). Network 
services are seen to recover fairly quickly while robustness also 
increased rapidly. The results are the average of 10 random 
networks with 30 initial nodes (29 after the fault occurs), and 
50 services. 

30-Node Network - mean resuH over to iterations 
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Figure 3 . Reconstitution metrics for a 30-node network. The horizontal axis 
represents time steps (arbitrary units) after a fault at time 0. The vertical axis 
shows the algebraic connectivity (blue, scale on left) and number of services 
operational (green, scale on right). The vertical bars represent the standard 
deviation over 10 random networks. 

VIII. Conclusions 

This paper presented preliminary research results towards 
developing a mathematical formulation for effective 
autonomous reconstitution of compromised cyber systems. The 
proposed formulation enables the application of classical 
optimization approaches to determine optimal choices for 
reconfiguring and improving the resilience of the cyber-system 
after one or more natural faults or adversarial attacks. 
Preliminary results of simulation studies indicated that the 
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proposed approach may be viable for reconstitution of 
compromised cyber systems. Unlike existing techniques, the 
proposed approach is seen to be viable even when faults occur 
during the reconstitution process. However, the simulation 
studies to date included several assumptions that will need to 
be relaxed if the results are to be applicable more generally. 
The relaxation of these assumptions, and the evaluation of the 
proposed formulation under these more general conditions, will 
be the focus of future work. 
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